Big Data Technologies: A Case Study
Padmavathi Vanka, Prof. T. Sudha
SPMVV Department of Computer Science, SPMVV, Tirupathi, India.
*Corresponding Author E-mail: vanka.padmavathi@gmail.com, thatimakula_sudha@yahoo.com
ABSTRACT:
The development of Big Data is rapidly accelerating and affecting all areas of technologies by increasing the benefits for individuals and organizations. Big data can be categorized by its volume, variety and velocity. Since data size is bigger, it requires sophisticated techniques, tool and architectures to analyze the data. To extract knowledge from Big Data, various models, programs, softwares, hardwares and technologies have been designed and proposed. They try to ensure more accurate and reliable results for Big Data applications. In fact, many parameters should be considered: technological compatibility, deployment complexity, cost, efficiency, performance, reliability, support and security risks. This paper is a case study that review recent technologies developed for Big Data. It aims to help to select and adopt the right combination of different Big Data technologies according to their technological needs and specific applications’ requirements.
KEYWORDS: Big Data, deployment complexity, volume, variety, velocity
INTRODUCTION:
We are living in the era of Big Data. Today a vast amount of data is generating everywhere due to advances in the Internet and communication technologies and the interests of people using smart phones, social media, Internet of Things, sensor devices, online services and many more. Similarly, in improvements in data applications and wide distribution of software, several government and commercial organizations such as financial institutions, healthcare organization, education and research department, energy sectors, retail sectors, life sciences and environmental departments are all producing a large amount of data every day. The huge datasets not only include structured form of data but more than 75% of the dataset includes raw, semi-structured and unstructured form of data. This massive amount of data with different formats can be considered as Big Data.
Fig: View of Big Data
Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it involves many areas of business and technology. Big data involves the data produced by different devices and applications. Some of the fields that come under the umbrella of Big Data are, Black Box Data which is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft. Social Media Data such as Facebook and Twitter hold information and the views posted by millions of people across the globe. Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types namely Structured data which includes Relational data, Semi Structured data which includes XML data and Unstructured data which includes Word, PDF, Text, Media Logs.
Hadoop Ecosystem:
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Hadoop has different layers namely, Storage layer, Processing/ Computation layer, Querying Layer, Access Layer, Analytics and Management Layer.
Storage Layer:
File systems that manage storage across a network of computers are called distributed file systems. Hadoop has its own distributed file system called HDFS (Hadoop Distributed File System) which is designed for storing massive amounts of data across a cluster of nodes. The HDFS system is mainly designed for batch processing rather than interactive system. When a dataset is copied to the HDFS, the HDFS divides the dataset into blocks of an equal size, makes multiple replicas of each block and distributes throughout a cluster of nodes (Data Node) as an independent units. The replicas of the block are stored on different nodes and racks. The default size of a data block is 64MB and the number of replicas is 3, however, a user can change the size of block and number of replica in HDFS configuration file. The multiple replicas support fault-tolerance and availability of a data.
HBase is a scalable distributed column-oriented table designed on top of HDFS. As the Hadoop MapReduce does not support random access to data, Hadoop applications can access massive datasets in real-time with random read/write access via HBase. HBase is basically used to store a large number of web pages (billion) as a WebTable and MapReduce programs are executed against the WebTable to retrieve information. The WebTable is accessed randomly and in real-time as users click on a websites. HBase automatically divides the table horizontally into multiple regions and distributes it on regional server machines. Each region consists of a subsection of a table. Initially, there is a one region (table), however, when the size of the table grows and reaches to a configurable threshold, the system automatically partitions the table in row-wise into two equal regions.
|
Comparison between HDFS and Hbase features |
||
|
Properties HDFS HBase |
||
|
System |
HDFS is a distributed file system appropriate to store large files. |
HBase is a distributed non relational database built on the top of HDFS. |
|
Query and search Performance |
HDFS is not a general purpose file system. It does not provide fast record lookup in files. |
It enables fast record lookups (and updates) for large tables. |
|
Storage |
HDFS stores large files (gigabytes to terabytes in size) across Hadoop servers. |
HBase internally puts the data in indexed ‘‘StoreFiles” that exist on HDFS for high-speed lookups. |
|
processing |
HDFS is suitable for High Latency operations batch processing. |
HBase is built for Low Latency operations. |
|
Access |
Data is primarily accessed through MapRe- duce. |
HBase provides access to single rows from billions of records. |
|
Input-ouput operations |
HDFS is designed for batch processing and hence does not support random reads/writes operations. |
HBase enables reads/writes operations. Data is accessed through shell commands, client APIs in Java, REST, Avro |
Processing Layer:
MapReduce and YARN constitute two options to carry out data processing on Hadoop. They are designed to manage job scheduling, resources and the cluster. It is worth noticing that YARN is more generic than MapReduce. MapReduce has become the most widely adopted computing framework for big data analytics. In the MapReduce programming model, the computation is specified in the form of a map function and a reduce function. The map function process a block of dataset as a (key, value) pair and produces map output in the form of a list of (key, value) pairs. The intermediate values are grouped together based on the same key e.g. 2 k and then pass to the reduce function. The reduce function takes the intermediate key 2 k along with its associated values and processes them to produce anew list of values as final output. MapReduce is a highly scalable computing model to enable thousands of inexpensive commodity computers to be used as an effective computing platform for distributed and parallel computing.
YARN is more generic than MapReduce. It provides a better scalability, enhanced parallelism and advanced resource management in comparison to MapReduce. It offers operating system functions for Big Data analytical applications. Hadoop architecture has been changed to incorporate YARN Resource Manager. In general, YARN works on the top of HDFS. It allows also handling both batch processing and real-time interactive processing. YARN is compatible with Application Programming Interface (API) of MapReduce. In fact, users have just to recompile MapReduce jobs in order to run them on YARN.
Querying Layer:
Pig is a scripting language (data flow language) that is developed by Yahoo for analyzed large amount of datasets in parallel through a language called Pig Latin. The Pig compiler automatically converts the Pig script into series of MapReduce programs so that it can be executed on a Hadoop cluster. Pig script can run on a single virtual machine using JVM or it can be executed on cluster of nodes. In Pig script, commands such as filtering, grouping and joining can be expressed in the form of user-defined functions. Pig is basically developed for batch processing. It is not appropriate for all types of data processing applications. If a user query only touch a small portion of data in a huge dataset, the Pig will not perform better because it will scan the whole dataset or at least a huge portion of the dataset.
JAQL is a declarative language on top of Hadoop that provides a query language and supports massive data processing. It converts high level queries into MapReduce jobs. It was designed to query semi-structured data based on JSONs (Java- Script Object Notation) format. So, JAQL like Pig does not require a data schema. JAQL provides several in-built functions, core operators and I/O adapters. Such features ensure data processing, storing, translating and data converting into JSON format.
Hive is a query language for data warehousing on top of Hadoop. It is designed by Facebook to execute SQL like statements on a large volume of datasets generated by Facebook every day and stored in HDFS. Hive interacts with dataset via HiveQL, a Hive query language based on SQL. Hive can be fitted between Pig and traditional RDBMS because like the RDBMS, Hive uses relation (table) with a schema to store the dataset and similar to the Pig, Hive use distributed storage (HDFS) to store the tables. Users who are familiar with map/reduce programming can express the logic into map functions and reduce functions and plug into Hive, if it is difficult for them to express the logic in HiveQL.
|
Hive, Pig and JAQL features. |
|||
|
Properties Data querying tools |
|||
|
|
Hive |
Pig |
Jaql |
|
Language |
HiveQL (SQL-like) |
Pig Latin (script-based language) |
JAQL |
|
Type of language |
Declarative (SQL dialect) |
Data flow |
Data flow |
|
Data structures |
Suited for structured data |
Scalar and complex data types |
File-based data |
|
Schema |
It has tables’ metadata stored in the database |
Schema is optionally defined at runtime |
Schema is optional |
|
Data Access Developer |
JDBC, ODBC, Facebook |
PigServer, Yahoo |
Jaql web server, IBM |
Access Layer:
Apache Sqoop:
Hadoop processes a vast dataset when it is stored in HDFS. If the dataset is stored outside HDFS e.g. in a relational database, then Hadoop program needs to employ external APIs. Sqoop is an open-source tool that provides facilities to users to efficiently fetch a huge dataset from a relational database into Hadoop for onward processing. In addition, Sqoop can transfer data from a relational database system into the HBase system. It currently works with the relational databases including MySQL, SQL server, Oracle, DB2 and Postgre SQL.
Apache Flume:
Apache Flume is a highly reliable and distributed service which can be used to automatically collect and aggregate a huge streaming data from different sources and transfer into HDFS. Initially it was developed to collect streaming data from web log but now it can be used to collect datasets from different sources and transfer into HDFS. The Flume architecture mainly includes source, sink (which delivers the data to HDFS), channel (a conduit which connects the source and sink) and agent (JVM that runs Flume services).
Chukwa is a data collection system built on the top of Hadoop. Chukwa’s goal is to monitor large distributed systems. It uses HDFS to collect data from various data providers, and MapReduce to analyze the collected data. It inherits Hadoop’s scalability and robustness. It provides an interface to display, monitor and analyze results Chukwa offers a flexible and powerful platform for Big Data. It enables analysts to collect and analyze Big Data sets as well as to monitor and display results. To ensure flexibility, Chukwa is structured as a pipeline of collection, processing stages as well as defined interfaces between stages.
Data Analytics:
Mahout library provides analytical capabilities and multiple optimized algorithms. Mahout is essentially a set of Java libraries. It has the benefit of ensuring scalable and efficient implementation of large scale machine learning applications and algorithms over large data sets. For instance, it offers libraries for clustering (like K-means, fuzzy K-means, Mean Shift), classification, collaborative filtering (for predictions and comparisons), frequent pattern mining and text mining (for scanning text and assigning contextual data). Additional tools include topic modeling, dimensionality reduction, text vectorization, similarity measures, a math library, and more. The various companies those who have implemented scalable machine learning algorithms are Google, IBM, Amazon, Yahoo, Twitter and Facebook.
R is a programming language for statistical computing, machine learning and graphics. R is free, open-source software distributed and maintained by the R-project that relies on a community of users, developers and contributors. R programming language includes a well- A developed, simple and effective functionalities, including conditionals, loops, user-defined recursive functions and input and output facilities. Many Big Data distributions (like Cloudera, Hortonworks and Oracle) use R to perform analytics.
|
Comparison between Flume and Chukwa. |
||
|
Properties Projects |
||
|
Chukwa Flume |
||
|
Real-time |
It acquires data for periodic real-time analysis (within minutes) |
It focuses on continuous real-time analysis (within seconds) |
|
Architecture |
batch system |
Continuous stream processing system |
|
Manageability |
It distributes information about data flows broadly among its services |
It maintains a central list of ongoing data flows, stored redundantly using Zookeeper
|
|
Reliability |
The agents on each machine are responsible for deciding what data to send. Chukwa uses an end-to-end delivery model that can leverage local on-disk log files for reliability |
Robust/Fault tolerant with tunable reliability mechanisms and failover and recovery mechanisms. Flume adopts a hop-by-hop model |
Management Layer:
ZooKeeper is a centralized distributed coordination service for distributed applications. It is originally developed by Yahoo and later it has become part of the Hadoop ecosystem. The services provide by ZooKeeper include configuration management, synchronization, naming and group membership. HBase, Flume and HDFS HA (high availability) all depend on ZooKeeper.
Apache Avro is a framework for modeling, serializing and making Remote Procedure Calls (RPC) . Avro defines a compact and fast binary data format to support data intensive applications, and provides support for this format in a variety of programming languages such us Java, Scala, C, C++ and Python. Avro ensures efficient data compression and storages at various nodes of Apache Hadoop. Within Hadoop, Avro passes data from one program or language to another.
Apache Oozie is a workflow scheduler system designed to run and manage jobs in Hadoop clusters. It is a reliable, extensible and scalable management system that can handle efficient execution of large volume of workflows. The workflow jobs take the form of a Directed Acyclical Graphs (DAGs). Oozie can support various types of Hadoop jobs including MapReduce, Pig, Hive, Sqoop and Distcp jobs. Oozie enables to track the execution of the workflows.
CONCLUSION:
In this paper, we have discussed different technologies used in each layer of Big Data platforms. Different technologies and distributions have been also compared in terms of their capabilities, advantages and limits. In spite of the important developments in Big Data field, we can notice through our comparison of various technologies, that many short comings exist. Most of the time, they are related to adopted architectures and techniques for big data infrastructure.
REFERENCES
1. Acharjya, D., Ahmed, K.P., 2016a. A survey on big data analytics: challenges, open research issues and tools. Int. J. Adv. Comput. Sci. App. 7, 511–518.
2. Aher, S.B., Kulkarni, A.R., 2015. Hadoop mapreduce: a programming model for large scale data processing. Am. J. Comput. Sci. Eng. Surv. (AJCSES) 3, 01–10.
3. Ali, A., Qadir, J., urRasool, R., urRasool, R., Sathiaseelan, A., Zwitter, A., Crowcroft, J., 2016. Big data for development: applications and techniques. Big Data Anal. 1, 2.
4. Ames, A., Abbey, R., Thompson, W., 2013. Big Data Analytics Benchmarking SAS, R, and Mahout. SAS Technical Paper.
5. Azarmi, B., 2016a. The big (data) problem. In: Scalable Big Data Architecture. Springer, pp. 1–16.
6. Azarmi, B., 2016b. Scalable Big Data Architecture. Springer.
7. Bansal, H., Mehrotra, S., Chauhan, S., 2016. Apache Hive Cookbook. Packt Publ. Benjelloun, F.-Z., Ait Lahcen, A., 2015. Big data security: challenges, recommendations and solutions. In: Handbook of Research on Security Considerations in Cloud Computing. IGI Global, pp. 301–313.
8. Benjelloun, F.-Z., Ait Lahcen, A., Belfkih, S., 2015. An overview of big data opportunities, applications and tools. In: Intelligent Systems and Computer Vision (ISCV), 2015 (pp. 1–6). IEEE.
Received on 26.09.2017 Modified on 08.11.2017
Accepted on 09.12.2017 ©A&V Publications All right reserved
Research J. Science and Tech. 2017; 9(4): 639-642.
DOI: 10.5958/2349-2988.2017.00109.7